Skip to content

Add retry decorator to tests that are vulnerable to transient service issues#421

Merged
timmarkhuff merged 7 commits intomainfrom
tim/add-retry-decorator
Apr 8, 2026
Merged

Add retry decorator to tests that are vulnerable to transient service issues#421
timmarkhuff merged 7 commits intomainfrom
tim/add-retry-decorator

Conversation

@timmarkhuff
Copy link
Copy Markdown
Contributor

@timmarkhuff timmarkhuff commented Apr 7, 2026

Some tests in python-sdk are vulnerable to bad responses from the cloud service. For example, an image query might get a result of STILL_PROCESSING, which means the cloud didn't have an answer in time. This is a transient error and will almost always be resolved with a retry.

This PR adds a retry decorator to protect such tests. Any test that submits an image query and asserts anything about the result are protected with this decorator.

I also found a few instances of tests that were not using our standard detector_name function for naming detectors. I fixed those too.

@timmarkhuff timmarkhuff requested a review from brandon-wada April 7, 2026 20:38
@timmarkhuff timmarkhuff changed the title Add retry decorator Add retry decorator to test functions that are vulnerable to transient service issues Apr 7, 2026
@timmarkhuff timmarkhuff changed the title Add retry decorator to test functions that are vulnerable to transient service issues Add retry decorator to tests that are vulnerable to transient service issues Apr 7, 2026
Copy link
Copy Markdown
Collaborator

@brandon-wada brandon-wada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with trying this out.

I think the central worry on something like this is that it might hide real flaky issues. However, we're currently fairly comfortable noting that the flakiness we observe is due to irregular usage patterns when 12 copies of the same tests are running in sync with each other via GHA. Given that, we're unlikely to be hiding issues that are relevant to any real usage, and we should be able to observe real issues in our BE alerting

@timmarkhuff
Copy link
Copy Markdown
Contributor Author

I'm fine with trying this out.

I think the central worry on something like this is that it might hide real flaky issues. However, we're currently fairly comfortable noting that the flakiness we observe is due to irregular usage patterns when 12 copies of the same tests are running in sync with each other via GHA. Given that, we're unlikely to be hiding issues that are relevant to any real usage, and we should be able to observe real issues in our BE alerting

I think ideally, none of the SDK tests should be intended to uncover flakiness. Flakiness is better discovered through canary tests than unit tests in python-sdk.

@timmarkhuff timmarkhuff merged commit 44f1732 into main Apr 8, 2026
15 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants